R Markdown

1.Load the packages tidytext and tidyverse

library(tidyverse)
library(tidytext)

Step 1: Packages are loaded for widely used functions and to tidy data sets.

  1. Read-in the following datasets in R (the dataset is available on Canvas). This dataset contains review and summary of beauty products on Amazon.
#read in the data
reviews <- read_csv("amazonbeauty_review.csv")
## Rows: 1150 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): reviewerID, asin, reviewerName, reviewText, summary, reviewTime
## dbl (4): helpful__001, helpful__002, overall, unixReviewTime
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 2: The Amazon Reviews dataset is loaded into the space as reviews.

  1. Describe the variables in the data set and number of observations.

Step 3: The data set has 1150 observations and 10 variables. One of the variables has the review text.

  1. Create a tidytext dataset - Tokenize by bigrams.
bigrams <- reviews %>%
  unnest_tokens(bigram, reviewText, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram))

bigrams

Step 4: a tidied datset is created with reviewText column tokenized into (2) words and filtered to remove any NA values.

  1. Separate the bigrams and preprocess the text. Filter stop-words and words less than 3 characters.
bigrams_separated <- bigrams %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>% 
    subset(nchar(gsub("[^ ]", "", word1)) < 3) %>% 
    subset(nchar(gsub("[^ ]", "", word2)) < 3)



bigrams_filtered <- bigrams_separated %>%
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word)

bigrams_filtered

Count the most common bigrams.(10 points)

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

Step 5: The bigrams are separated into (2) columns, one word each. Word1 and word2 columns are filtered to remove all stop words. A count of words in both columns are conducted and sorted in a decreasing manner. This shows the number of times the words are present.

  1. Load the package igraph:
library(igraph)

Step 6: A graphing package is loaded.

  1. Use the output from step 5 and build a network of common bigrams [filter for only relatively common combinations (based on n – use lines instead of directed arrows between nodes (graph_from_data_frame (directed = FALSE))].
# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
    filter(n > 20) %>%
    graph_from_data_frame((directed = FALSE))

bigram_graph
## IGRAPH 49016fd UN-- 21 14 -- 
## + attr: name (v/c), n (e/n)
## + edges from 49016fd (vertex names):
##  [1] sensitive  --skin      dry        --skin      curling    --iron     
##  [4] alpha      --hydrox    oily       --skin      highly     --recommend
##  [7] hand       --cream     fragrance  --free      oil        --free     
## [10] acne       --prone     prone      --skin      buf        --puf      
## [13] combination--skin      skin       --care

Step 7: We filter the words for the most common by accepting words that apperaed more than 20 times throughout the reviews.

  1. Load the package ggraph
library(ggraph)

Step 8: We load the graphing package for the next step.

  1. Visualize the graph - Use the Fruchterman-Reingold to visualize the nodes and ties (“fr”). Apply some polishing operations to make a better looking graph.
set.seed(2017)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), show.legend = FALSE,edge_colour = "cyan4") +
  geom_node_point(size = 1) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()
## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Step 9: We first set our seed for replication purposes and visualize the network graph without arrows.

The graph shows the most common bigrams in the Reviews data set. The words shown occur at least 20 times, and were not stop words. IN the reviews, ‘highly recommend’ is commonly used for proposed products. Once we cleaned and visualized the bigrams, it is clear to see the most common words were sectioned based on skin, hair, and even review words like ‘free’, and ‘prone’.